A transformer model operates on time series or sequential data using a form of attention, identifiying past states/tokens that are particualrly related to the current token and then using these in particular as part of predicting future tokens. This is like the human ability to hear a sentance such as "The cat sat in the deep orange glow of sunset and licked its fur" – when you read the word 'licked', your mind instantly pulls out the word 'cat' as related and uses that to make sense of the current pont in the sentance. We use rich semantic structures to perform this, but transformer models use a vector simiilarity between a 'key' and 'query' for each token, where the key models the kind of thing that the input token is, and the query the kind of thing it would like to connect with.
Used in Chap. 14: pages 231, 237; Chap. 19: pages 318, 319; Chap. 22: page 369; Chap. 23: page 392; Chap. 24: page 400
Also known as transformer networks